NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DeWinder: Single-Channel Wind Noise Reduction using Ultrasound Sensing

https://doi.org/10.21437/Interspeech.2024-2180

Yuan, Kuang; Han, Shuo; Kumar, Swarun; Raj, Bhiksha (September 2024, ISCA)

Full Text Available
Metric from Human: Zero-shot Monocular Metric Depth Estimation via Test-time Adaptation

Zhao, Yizhou; Bian, Hengwei; Chen, Kaihua; Ji, Pengliang; Qu, Liao; Lin, Shao-yu; Yu, Weichen; Li, Haoran; Chen, Hao; Shen, Jun; et al (December 2024, Openreview)

Full Text Available
Synergistic Global-Space Camera and Human Reconstruction from Videos

https://doi.org/10.1109/CVPR52733.2024.00122

Zhao, Yizhou; Wang, Tuanfeng Yang; Raj, Bhiksha; Xu, Min; Yang, Jimei; Huang, Chun-Hao Paul (June 2024, IEEE)

Full Text Available
ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs For Audio, Music, and Speech

https://doi.org/10.1109/SLT61566.2024.10832289

Shi, Jiatong; Tian, Jinchuan; Wu, Yihan; Jung, Jee-Weon; Yip, Jia Qi; Masuyama, Yoshiki; Chen, William; Wu, Yuning; Tang, Yuxun; Baali, Massa; et al (December 2024, IEEE)

Full Text Available
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

https://doi.org/10.1609/aaai.v37i3.25412

Yamazaki, Kashu; Vo, Khoa; Truong, Quang Sang; Raj, Bhiksha; Le, Ngan (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT.
more » « less
Full Text Available
AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation

https://doi.org/10.1007/s11263-022-01702-9

Vo, Khoa; Truong, Sang; Yamazaki, Kashu; Raj, Bhiksha; Tran, Minh-Triet; Le, Ngan (January 2023, International Journal of Computer Vision)

Full Text Available

Search for: All records